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Abstract 

The problem of graphical model selection is to correctly estimate the graph structure of a 
0^ i Markov random field given samples from the underlying distribution. We analyze the information- 

' theoretic limitations of the problem of graph selection for binary Markov random fields under 

high-dimensional scaling, in which the graph size p and the number of edges fc, and/or the 
maximal node degree d are allowed to increase to infinity as a function of the sample size n. 
For pairwise binary Markov random fields, we derive both necessary and sufficient conditions 
for correct graph selection over the class Gp,k of graphs on p vertices with at most k edges, 
and over the class Qp^d of graphs on p vertices with maximum degree at most d. For the class 
' Gp,kj we establish the existence of constants c and c' such that if n < cklogp, any method has 

error probability at least 1/2 uniformly over the family, and we demonstrate a graph decoder 
that succeeds with high probability uniformly over the family for sample sizes n > c'k^logp. 
Similarly, for the class Gp,d, we exhibit constants c and c' such that for n < cd^ logp, any method 
fails with probability at least 1/2, and we demonstrate a graph decoder that succeeds with high 
tyj ' probability for n > c'd^ log p. 

_o ■ 

1 Introduction 

> , 

0^ , Markov random fields (also known as undirected graphical models) provide a structured representa- 

tion of the joint distributions of families of random variables. They are used in various application 
^SJ ' domains, among them image analysis [14^ [5]. social network analysis [27^ I29j. and computational 

Xf^ . biology \12 \ 120 1 [T]. Any Markov random field is associated with an underlying graph that describes 

I conditional independence properties associated with the joint distribution of the random variables. 

The problem of graphical model selection is to recover this unknown graph using samples from the 
distribution. 

Given its relevance in many domains, the graph selection problem has attracted a great deal 
^ ' of attention. The naive approach of searching exhaustively over the space of all graphs is compu- 

' tationally intractable, since there 2^2) distinct graphs over p vertices. If the underlying graph is 

known to be tree-structured, then the graph selection problem can be reduced to a maximum- weight 
spanning tree problem and solved in polynomial time f9j. On the other hand, for general graphs 
with cycles, the problem is known to be difficult in a complexity-theoretic sense [8]. Nonetheless, 
a variety of methods have been proposed, including constraint-based approaches [26l[20], thresh- 
olding methods [6], and £i-based relaxations [211 [22l [32l IHl [Mj • Other researchers [El E] have 
analyzed graph selection methods based on penalized forms of pseudolikelihood. 

Given a particular procedure for graph selection, a classical analysis studies its behavior for a 
fixed graph as the sample size n is increased. In this paper, as with an evolving line of contemporary 
statistical research, we address the graph selection problem in the high- dimensional setting, meaning 
that we allow the graph size p as well as other structural parameters, such as the number of edges k 
or the maximum vertex degree d, to scale with the sample size n. We note that a line of recent work 
has established some high-dimensional consistency results for various graph selection procedures, 
including methods based on £i-regularization for Gaussian models [21^ [23l [24], ^i-regularization 
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for binary discrete Markov random fields [22], thresholding methods for discrete models [6], and 
variants of the PC algorithm for directed graphical models [2^. All of these methods are practically 
appealing given their low-computational cost. 

Of complementary interest — and the focus of the paper — are the information-theoretic limita- 
tions of graphical model selection. More concretely, consider a graph G = iV^E), consisting of a 
vertex set V with cardinality p, and an edge set E d V x V . In this paper, we consider both the 
class Qp^k of all graphs with \E\ < k edges, as well as the class Gp^d all graphs with maximum vertex 
degree d. Now suppose that we are allowed to collect n independent and identically distributed 
(i.i.d.) samples from a Markov random field defined by some graph G £ Qp^t (or Qp^d)- Remember- 
ing that the graph size p and structural parameters {k, d) are allowed to scale with the sample size, 
we thereby obtain sequences of statistical inference problems, indexed by the triplet (n,p, k) for the 
class Gp.k: and by the triplet {n,p,d) for the class Gp^d- The goal of this paper is to address ques- 
tions of the following type. First, under what scalings of the triplet (n,p,k) (or correspondingly, 
the triplet {n,p,d)) is it possible to recover the correct graph with high probability? Conversely, 
under what scalings of these triplets does any method fail most of the time? 

Although our methods are somewhat more generally applicable, so as to bring sharp focus to 
these issues, we limit the analysis of this paper to the case of pairwise binary Markov random fields, 
also known as the Ising model. The Ising model is a classical model from statistical physics [18^ H]. 
where it is used to model physical phenomena such as crystal structure and magnetism; more 
recently it has been used in image analysis [5l IH], social network modeling [3^, '27], and gene 
network analysis [U |25] . 

At a high level, then, the goal of this paper is to understand the information-theoretic ca- 
pacity of Ising model selection]^ Our perspective is not unrelated to a line of statistical work in 
non-parametric estimation [151 \T7\ [3T| 130] . in that we view the observation process as a channel 
communicating information about graphs to the statistician. In contrast to non-parametric esti- 
mation, the spaces of possible "codewords" are not function spaces but rather classes of graphs. 
Accordingly, part of the analysis in this paper involves developing ways in which to measure dis- 
tances between graphs, and to relate these distances to the Kullback-Leibler divergence known 
to control error rates in statistical testing. We note that understanding of the graph selection 
capacity can be practically useful in two different ways. First, it can clarify when computation- 
ally efficient algorithms achieve information-theoretic limits, and hence are optimal up to constant 
factors. Second, it can reveal regimes in which the best known methods to date are sub-optimal, 
thereby motivating the search for new and possibly better methods. Indeed, the analysis of this 
paper has consequences of both types. 

In this paper, we prove four main theorems, more specifically necessary and sufficient conditions 
for the class Qp^k of bounded edge cardinality models, and for the class Qp^d of bounded vertex 
degree models. Proofs of the necessary conditions (Theorems [T] and [2]) use indirect methods, based 
on a version of Fano's lemma applied to carefully constructed sub-families of graphs. On the other 
hand, our proof of the sufficient conditions (Theorems [3] and [4]) is based on direct analysis of explicit 
"graph decoders". The remainder of this paper is organized as follows. We begin in Section [2] with 
background on Markov random fields, the classes of graphs considered in this paper, and a precise 
statement of the graphical model selection problem. In Section |3l we state our main results and 
explore some of their consequences. Section|3]is devoted to proofs of the necessary conditions on the 
sample size (Theorems [1] and [2]) , whereas Section [5] is devoted to proofs of the sufficient conditions. 
We conclude with a discussion in Section [6l 

^In this paper, we assume that the data is drawn from some Ising model from the class Qp^k and Gp.dj so that we 
study the probability of recovering the exact model. However, similar analysis can be applied to the problem of find 
the best approximating distribution using an Ising model from class Qp^^ or Gp^k- 
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Notation: For the convenience of the reader, we summarize here notation to be used throughout 
the paper. We use the following standard notation for asymptotics: we write f{n) = 0{g{n)) if 
f{n) < cg{n) for some constant c < oo, and f{n) = ^}(g{n)) if f{n) > c'g{n) for some constant 
c' > 0. The notation f{n) = Q{g{n)) means that f{n) = 0{g{n)) and f{n) = Q{g{n)). 



2 Background and problem formulation 

We begin with some background on Markov random fields, and then provide a precise formulation 
of the problem. 



2.1 Markov random fields and Ising models 

An undirected graph G = (VjE) consists a collection V = {1,2, ... ,p} of vertices joined by a 
collection E C V x V of edgeso The neighborhood of any node s E F is the subset Af{s) C V 

Af{s) := {teV \ {s,t)e E}, (1) 

and the degree of vertex s is given hy dg : = |AA(s)|, corresponding to the cardinality of this neighbor 
set. We use d = max^gy dg to denote the maximum vertex degree, and k = \E\ to denote the total 
number of edges. 

A Markov random field is obtained by associating a random variable Xs to each vertex s £ V, 
and then specifying a joint distribution P over the random vector (Xi, . . . ,Xp) that respects the 
graph structure in a particular way. In the special case of the Ising model, each random variable 
Xg takes values {—1, +1}, and the the probability mass function has the form 

Fg{xi,...,Xp) = -^ff^^^Pi J2 ^'^t^sXt} (2) 

where Z(9) is the normalization factor given by 

Z{e) := log Yl exp{ ^ stXsXt}\. (3) 

x-e{-i,+i}p {s,t)&E 

To be clear, we view the parameter vector 9 as an element of M^a) with the understanding that 
6st = for all pairs {s,t) ^ E. So as to emphasize the graph-structured nature of the parameter 9, 
we often use the notation 9{G). The edge weight 9st describes the conditional dependence between 
Xg and Xt, given fixed values for all vertices X^, u ^ s,t. In particular, a little calculation shows 
that the conditional distribution takes the form 

¥g{xs,Xt I Xv\{s,t}) OC. exp(9stXsXt+ ^ 9 

usXuXg ~r / ^utXuXt I • 
^ ueAf{s)\t ueAfit)\s ^ 

The Ising model ([2]) has its origins in statistical physics [181 H! ■, where it used to model physical 
phenomena such as crystal structure and magnetism; it is also has been used as a simple model in 
image processing [5l [13] , gene network analysis [U [25] , and in modeling social networks [3l [27] . For 
instance, Banerjee et al. [3] use this model to describe the voting behaviors of p politicians, where 
Xg represents whether politician s voted for [Xg = +1) or against [Xg = — 1) a particular bill. 
In this case, a positive edge weight 9si > would mean that conditioned on the other politicians' 
votes, politician s and t are more likely to agree in their voting (i.e., Xg = Xt) than to disagree 
(^Xg 7^ Xf), whereas a negative edge weight means that they are more likely to disagree. 



In this paper, we forbid self-loops in the graph, meaning that (s, s) ^ E for all s G V. 
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2.2 Classes of graphical models 

In this paper, we consider two different classes of Ising models ([2]), depending on the condition that 
we impose on the edge set E. In particular, we consider the two classes of graphs: 

(a) the collection Qp^^ of graphs such that each vertex has degree at most d for some d> 1, and 

(b) the collection Qp^^ of graphs G with \E\ < k edges for some A: > 1. 

In addition to the structural properties of the graphs, the difficulty of graph selection also depends 
on properties of the vector of edge weights 9{G) G R^a). Naturally, one important property is the 
minimum value over the edges. Accordingly, we define the function 

X*{e{G)) := min JOstl (4) 

The interpretation of the parameter A is clear: as in any signal detection problem, it is obviously 
difficult to detect an interaction Ogt if it is extremely close to zero. 

In contrast to classical signal detection problems, estimation of the graphical structure turns 
out to be harder if the edge parameters 6st are large, since the large value of edge parameters can 
mask the presence of interactions on other edges. The following example illustrates this point: 

Example 1. Consider the family Qp j. of graphs on p = 3 with k = 2 edges; note that there are a 
total of 3 such graphs. For each of these three graphs, consider the parameter vector 

9{G) = [9 9 0], 

where the single zero corresponds to the single distinct pair s ^ t not in the graph's edge set, as 
illustrated in FigureUi In the limiting case 9 = +oo, for any choice of graph with two edges, 




(a) (b) (c) 

Figure 1. Illustration of the family Qp^k for p = 3 and k — 2; note that there are three distinct 
graphs G with p = 3 vertices and k = 2 edges. Setting the edge parameter 0{G) = [9 9 0] induces 
a family of three Markov random fields. As the edge weight parameter 9 increases, the associated 
distributions Pe(G) become arbitrarily difficult to separate. 



the Ising model distribution enforces the "hard-core" constraint that {Xi, X2, X^) must all be equal; 
that is, for any graph G, the distribution ^0(^0) places mass 1/2 on the configuration [+1 +1 +l] 
and mass 1/2 on the configuration [—1 —1 — l]. Of course, this hard-core limit is an extreme 
case, in which the models are not actually identifiable. Nonetheless, it shows that if the edge weight 
9 is finite but very large, the models will not identical, but nonetheless will be extremely hard to 
distinguish. 

Motivated by this example, we define the maximum neighborhood weight 

u*{9{G)) ■= max V \9st\. (5) 
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Our analysis shows that the number of samples n required to distinguish graphs grows exponentially 
in this quantity. 

In this paper, we study classes of Markov random fields that are parameterized by a lower bound 
A on the minimum edge weight, and an upper bound lo on the maximum neighborhood weight. 

Definition 1 (Classes of graphical models), (a) Given a pair (A, a;) of positive numbers, the 
set t/p,d(A, to) consists of all distributions Fe{G) of the form ([2]) such that (i) the underlying 
graph G = (V, E) is a member of the family Qp^d of graphs on p vertices with vertex degree at 
most d; (ii) the parameter vector 6 = 9{G) respects the structure of G, meaning that Ogt 7^ 
only when (s,t) S E, and (iii) the minimum edge weight and maximum neighborhood satisfy 
the bounds 

X*{9{G)) > A, and uj*{e{G)) < uj. (6) 

(b) The set Qp^ki^^'^) is defined in an analogous manner, with the graph G belonging to the class 
Qp^k of graphs with p vertices and k edges. 

We note that for any parameter vector 0(G), we always have the inequality 

co*i9{G)) > max|AA(s)| A*(0(G)), (7) 

so that the families ^p^fc(A,(j) and Qp (i{X,uj) are only well-defined for suitable pairs (A,w). 

2.3 Graph decoders and error criterion 

For a given graph class G (either Gp ^ or Gp,k) and positive weights {\,uj), suppose that nature 
chooses some member P5i(g) from the associated family Q{X,uj) of Markov random fields. Assume 
that the statistician observes n samples : = {X^^\ . . . , drawn in an independent and 

identically distributed (i.i.d.) manner from the distribution ^0{g)- Note that by definition of the 
Markov random field, each sample X^*) belongs to the discrete set X := {—1,-1-1}*', so that the 
overall data set X" belongs to the Cartesian product space Af". 

We assume that the goal of the statistician is to use the data X^^ to infer the underlying graph 
G £ Q, which we refer to as the problem of graphical model selection. More precisely, we consider 
functions (j) '■ ~^ Gi which we refer to as graph decoders. We measure the quality of a given 
graph decoder (p using the 0-1 loss function l[(p(X.^^) 7^ G], which takes value 1 when (/'(X") 7^ G 
and takes the value otherwise, and we define associated 0-1 risk 

P,(G)[0(X?)/G] = Ee(G)[l[<A(X?)/G]], 

corresponding to the probability of incorrect graph selection. Here the probability (and expectation) 
are taken with the respect the product distribution of ^g(G) over the n i.i.d. samples. 

The main purpose of this paper is to study the scaling of the sample sizes n — -more specifically, 
as a function of the graph size p, number of edges k, maximum degree d, minimum edge weight 
A and maximum neighborhood weight uj — that are either sufficient for some graph decoder (p to 
output the correct graph with high probability, or conversely, are necessary for any graph decoder 
to output the correct graph with probability larger than 1/2. 

We study two variants of the graph selection problem, depending on whether the values of the 
edge weights 9 are known or unknown. In the known edge weight variant, the task of the decoder 
is to distinguish between graphs, where for any candidate graph G = iy, E), the decoder knows the 
numerical values of the parameters 9(G). (Recall that by definition, [9(G)]uv = for all (u,v) ^ E, 
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so that the additional information being provided are the values [9{G)]st for all (s,t) G E.) In 
the unknown edge weight variant, both the graph structure and the numerical values of the edge 
weights are unknown. Clearly, the unknown edge variant is more difficult than the known edge 
variant. We prove necessary conditions (lower bounds on sample size) for the known edge variant, 
which are then also valid for the unknown variant. In terms of sufficiency, we provide separate sets 
of conditions for the known and unknown variants. 



3 Main results and some consequences 

In this section, we state our main results and then discuss some of their consequences. We begin with 
statement and discussion of necessary conditions in Section 13. 1^ followed by sufficient conditions in 
Section 13. 2i 



3.1 Necessary conditions 

We begin with stating some necessary conditions on the sample size n that any decoder must satisfy 
for recovery over the families Qp^d and Gp^k- Recall ^ for the definitions of A and uj used in the 
theorems to follow. 

Theorem 1 (Necessary conditions for Gp^d)- Consider the family Gp^di^^^) of Markov random 
fields for some uj > 1. If the sample size is upper bounded as 

(logp exp(u;/4)dAlog(^-l) 

n < max< — . , , , , 5-r , -log — -, >, (8 

\2Atanh(A)' 128exp(f) ' 8 ^Sd'J' 

then for any graph decoder (p : X" Gp,d, whether given known edge weights or not, 

Remarks: Let us make some comments regarding the interpretation and consequences of Theo- 
rem [TJ First, suppose that both the maximum degree d and the minimum edge weight A remain 
bounded (i.e., do not increase with the problem sequences). In this case, the necessary condi- 
tions ([8|) can be summarized more compactly as requiring that for some constant c, a sample size 
n > ^ is required for bounded degree graphs. The observation of logp scaling has also been 
made in independent work [6], although the dependence on the signal-to-noise ratio A given here is 
more refined. Indeed, note that if the minimum edge weight decreases to zero as the sample size 
increases, then since Atanh(A) = 0{\^) for A — > 0, we conclude that a sample size n > ^ is 
required, for some constant c' . 

Some interesting phenomena arise in the case of growing maximum degree d. Observe that in 
the family Gp^di we necessarily have uj > Xd. Therefore, in the case of growing maximum degree 
d — > +00, if we wish the bound ([8]) not to grow exponentially in d, it is necessary to impose the 
constraint A = C(^)- But as observed previously, since Atanh(A) = O(A^) as A — > 0, we obtain the 
following corollary of Theorem [1} 

Corollary 1. For the family Gp^di^i'-^) with increasing maximum degree d, there is a constant c > 
such that in a worst case sense, any method requires at least n > c max{(i^, A~^} logp samples to 
recover the correct graph with probability at least 1/2. 
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We note that Ravikumar et al. |i22] have shown that under certain incoherence assumptions 
(roughly speaking, control on the Fisher information matrix of the distributions ^g[G)) ^^^d assuming 
that A = Q{d~^), a computationally tractable method using ^i-regularization can recover graphs 
over the family Qp^d using n > c' d^logp samples, for some constant c'; consequently. Corollary [T] 
shows concretely that this scaling is within a factor d of information-theoretic bound. 

We now turn to some analogous necessary conditions over the family Qp^k of graphs on p vertices 
with at most k edges. 

Theorem 2 (Necessary conditions for Qp^k)- Consider the family ^p^/c(A,aj) of Markov random 
fields for some u; > 1 . // the sample size is upper hounded as 



then for any graph decoder cp '■ X" Qp,k, whether given known edge weights or not, 

Remarks: Again, we make some comments about the consequences of Theorem [2j First, suppose 
that both the number of edges k and the minimum edge weight A remain bounded (i.e., do not 
increase with the problem sequences). In this case, the necessary conditions (jlOp can be summarized 
more compactly as requiring that for some constant c, a sample size n > ^ t^nhfA) required for 
graphs with a constant number of edges. Again, note that if the minimum edge weight decreases 
to zero as the sample size increases, then since Atanh(A) = C(A^) for A — > 0, we conclude that a 
sample size n > ^ is required, for some constant c'. 

The behavior is more subtle in the case of graph sequences in which the number of edges k 
increases with the sample size. As shown in the proof of Theorem [2l it is possible to construct a 
parameter vector 9{G) over a graph G with k edges such that U!*{9{G)) > XlVk\ . (More specifically, 
the construction is based on forming a completely connected subgraph on [Vk] vertices, which has 

a total of {^'^^) < k edges.) Therefore, if we wish to avoid the exponential growth from the 
term exp(a'), we require that A = 0{k~^^'^) as the graph size increases. Therefore, we obtain the 
following corollary of Theorem [2j 

Corollary 2. For the family Gp^k{X,uj) with increasing number of edges k, there is a constant c > 
such that in a worst case sense, any method requires at least n > c maxj/c, A~^} logp samples to 
recover the correct graph with probability at least 1/2. 

To clarify a subtle point about comparing Theorems [Hand [21 consider a graph G £ Gp,d, say one 
with homogeneous degree d at each node. Note that such a graph has a total of A: = dp/2 edges. 
Consequently, one might be misled into thinking Corollary [2] implies that n > logp samples 
would be required in this case. However, as shown in our development of sufficient conditions for 
the class Gp^d (see Theorem [3|) , this is not true for sufficiently small degrees d. 

To understand the difference, it should be remembered that our necessary conditions are worst- 
case results, based on adversarial choices from the graph families. As mentioned, the necessary 
conditions of Theorem [2] and hence of Corollary [2] are obtained by constructing a graph G that 
contains a completely connected graph, K^, with uniform degree Vk. But is not a member 

of Gp^d unless d > y/k. On the other hand, for the case when d > Vk, the necessary conditions 
of Corollary [1] amount to n > cklogp samples being required, which matches the scaling given in 
Corollary [2l 
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3.2 Sufficient conditions 



We now turn to stating and discussing sufficient conditions (lower bounds on the sample size) for 
graph recovery over the families Gp^d and Gp^k- These conditions provide complementary insight to 
the necessary conditions discussed so far. 

Theorem 3 (Sufficient conditions for Gp^d)- (a) Suppose that for some 6 G (0,1), the sample size 
n satisfies 

3r3exp(2u;) + ll , , , . ,x , 1 ■, 

" ^ ^djSlogp + log 2d +logT}. 12 

smh"^(^) o' 

Then there exists a graph decoder (jf : X" Qp d such that given known edge weights, the worst-case 
error probability satisfies 

(b ) In the case of unknown edge weights, suppose that the sample size satisfies 



n > 



UJ 



(3exp(2a; + l) 



sinh^(A/4) 
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'{l61ogj9 + 41og(2/(5)}. (14) 



Then there exists a graph decoder 0"^ : X" Qp^d that that has worst-case error probability at most 
6. 

Remarks: It is worthwhile comparing the sufficient conditions provided by Theorem [3] to the 
necessary conditions from Theorem [TJ 

First, consider the case of finite degree graphs. In this case, the condition p2|) reduces to the 
statement that for some constant c, it suffices to have n > cX~^ logp samples. Comparing with the 
necessary conditions (see the discussion following Theorem [T]), we see that for known edge weights 
and bounded degrees, the information-theoretic capacity scales as logp. For unknown edge 
weights, the conditions (I14p provide a weaker guarantee, namely that n > d logp samples are 
required; we suspect that this guarantee could be improved by a more careful analysis. 

In the case of growing maximum graph degree d, we note that like the necessary conditions ([8]), 
the sample size specified by the sufficient conditions (jl2p scales exponentially in the parameter oj. 
If we wish not to incur such exponential growth, we necessarily must have that A = 0{\/d). We 
thus obtain the following consequence of Theorem [3) 

Corollary 3. For the graph family Gp^di^,^) with increasing maximum degree, there exists a graph 
decoder that succeeds with high probability using n > ci maxjd^, A~^|(i logp samples. 

This corollary follows because the scaling A = 0{l/d) implies that A — > as d increases, and 
sinh(A/2) = 0{X) as A ^ 0. Note that in this regime. Corollary [T] of Theorem [T] showed that 
no method has error probability below 1/2 if n < C2 max{d^, A^^} logp, for some constant C2- 
Therefore, together Theorems [T] and [3] provide upper and lower bounds on the sample complexity 
of graph selection that are matching to within a factor of d. We note that under the condition 
A > ^, the results of Ravikumar et al. [22] also guarantee correct recovery with high probability 
for n > c^d^ logp using -fi-regularized logistic regression; however, their method requires additional 
(somewhat restrictive) incoherence assumptions that are not imposed here. 

Finally, we state sufficient conditions for the class Gp^k in the case of known edge weights: 
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Theorem 4 (Sufficient conditions for Qp^t)- (o-) Suppose that for some 5 £ (0, 1), the sample size 
n satisfies 

3exn(2a;) + 1 , 1, , , 

^ > + 1 logP + logT ■ 15 

smh^(^) 0' 

Then for known edge weights, there exists a graph decoder (p* : X" — > Qp^f^ such that 

o^r^T^'a /^(G)['/'*(X?)/G] < 6. (16) 

(b) For unknown edge weights, there also exists a graph decoder that succeeds under the condi- 
tion {[HI). 



Remarks: It is again interesting to compare Theorem H] with the necessary conditions from Theo- 
rem[2l To begin, let the number of edges k remain bounded. In this case, for A = o(l), condition (|15p 
states that for some constant c, it suffices to have n > samples, which matches (up to constant 
factors) the lower bound implied by Theorem [2j In the more general setting of /c — > +00, we begin 
by noting that like in Theorem [21 the sample size in Theorem [4] grows exponentially unless the 
parameter to stays controlled. As with the discussion following Theorem [21 one interesting scaling 
is to require that A x a choice which controls the worst-case construction that leads to the 

factor exp(u;) in the proof of Theorem [21 With this scaling, we have the following consequence: 

Corollary 4. Suppose that the minimum value A scales with the number of edges k as X >i k"^/"^ . 
Then in the case of known edge weights, there exists a decoder that succeeds with high probability 
using n > ck"^ \ogp samples. 

Note that these sufficient conditions are within a factor of k of the necessary conditions from 
Corollary [21 which show that unless n > c' max{A;, A^^} logp, then any graph estimator fails at 
least half of the time. 



4 Proofs of necessary conditions 

In the following two sections, we provide the proofs of our main theorems. We begin by introducing 
some background on distances between distributions, as well as some results on the cardinalities of 
our model classes. We then provide proofs of the necessary conditions (Theorems [T] and [2]) in this 
section, followed by the proofs of the sufficient conditions stated in Theorems [3] and [H in Section [5l 



4.1 Preliminaries 

We begin with some preliminary definitions and results concerning "distance" measures between 
different models, and some estimates of the cardinalities of different model classes. 



4.1.1 Distance measures 

In order to quantify the distinguishability of different models, we begin by defining some useful 
"distance" measures. Given two parameters 9 and 9' in M^a), we let D{6 \\ 9') denote the Kullback- 
Leibler divergence [10] between the two distributions Fq and Fqi. For the special case of the Ising 
model distributions ([2]), this Kullback-Leibler divergence takes the form 

X6{-1, + 1}P ^ ' 
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Note that the Kuhback-Leibler divergence is not symmetric in its arguments (i.e., D{6 \\ 6') 7^ 
D{6\\6') in general). 

Our analysis also makes use of two other closely related divergence measures, both of which are 
symmetric. First, we define the symmetrized Kullhack-Leihler divergence, defined in the natural 
way via 

S{9\\9') := D{e\\9') + D{9'\\9). (18) 

Secondly, given two parameter vectors 9 and 9\ we may consider the model specified by their 

2 

average. In terms of this averaged model, we define another type of divergence via 

J{9\\9') := D{^-±^\\9) + D{^-±^\\9'). (19) 

Note that this divergence is also symmetric in its arguments. A straightforward calculation shows 
that this divergence measure can be expressed in terms of the cumulant function ([3|) associated 
with the Ising family as 

Useful in our analysis are representations of these distance measures in terms of the vector of 
mean parameters fj,{9) G where element figt is given by 

^ist := MXsXt] = Yl mX]XsXt. (21) 

xe{-i,+i}p 

It is well-known from the theory of exponential families [Tj [28] that there is a bijection between the 
canonical parameters 9 and the mean parameters /i. 

Using this notation, a straightforward calculation shows that the symmetrized Kullback-Leibler 
divergence between ¥g and Fgi is equal to 

S{e\\9') = Yl {0st-9U){f-ist-f-i'st), (22) 
where fist and fi'^^ denote the edge-based mean parameters under 9 and 9' respectively. 
4.1.2 Cardinalities of graph classes 

In addition to these divergence measures, we require some estimates of the cardinalities of the graph 
classes Qp^d and Gp^k^ as summarized in the following: 

Lemma 1. (a) Fork < (2)72, the cardinality of Qp^k is bounded as 



< \Gp,k\ < ^{ ^M' (23) 



and hence log|^p^fc| = @{k\og -^). 
(b) For d < the cardinality of Qp^d is hounded as 



'd+V 

and hence log \ Qp,d\ = 0(prflog§). 



d{d+l) 
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Proof, (a) For the bounds ([23]) on |^p,A;|) we observe that there are [^yj graphs with exactly i 
edges, and that for k < (^) /2, we have (® ) < ((p) for all £ = 1, 2, . . . , fc. 

(b) Turning to the bounds on observe that every model in Qp^^ has at most ^ edges. 

Note that d < ensures that 



2 - V2. 

Therefore, following the argument in part (a), we conclude that \Gp,d\ < ^(2)^ claimed. 

In order to establish the lower bound (j24p , we first group the p vertices into d + 1 groups of 
size [^^J , discarding any remaining vertices. We consider a subset of Gp^d- graphs with maximum 
degree d having the property that each component edge straddles vertices in two different groups. 

To construct one such graph, we pick a permutation of [^j^J , and form an bijection from group 
1 to group 2 corresponding to the permutation. Similarly, we form an bijection from group 1 to 3, 
and so on up until d + 1. Note that use d permutations to complete this procedure, and at the end 
of this round, every vertex in group 1 has degree d, vertices in all other groups have degree 1. 

Similarly, in the next round, we use d—1 permutations to connect group 2 to groups 3 through 
d+1. In general, for i = 1, . . . , d, in round i, we use d+l — i permutations to connect group i with 
groups i + 1, . . . , d + 1. Each choice of these permutations yields a distinct graph in Qp^d- Note that 
we use a total of 



d{d+l) 



permutations over [-^j^l elements, from which the stated claim (f24|l follows. □ 
4.1.3 Fano's lemma and variants 

We provide some background on Fano's lemma and its variants needed in our arguments. Consider a 
family of M models indexed by the parameter vectors {6^^^ , 6^'^^ , ■ ■ ■ , 9^^^}. Suppose that we choose 
a model index k uniformly at random from { 1 , . . . , M} , and than sample a data set X" of n samples 
drawn in an i.i.d. manner according to a distribution Pg{fc) . In this setting, Fano's lemma provides a 
lower bound on the probability of error of any classification function (p : X'^ ^ {1, . . . , M}, specified 
in terms of the mutual information 

/(X^;i^) = HiX'l)- Hi'X'l \ K) (25) 

between the data X" and the random model index K. We say that a decoder cp : — > {1, . . . , M} 
is unreliable over the family if 

niax P,(.)[(/)(X-)#A:] > \. (26) 

We summarize Fano's inequality and a variant thereof in the following lemma: 

Lemma 2. Any of the following upper bounds on the sample size imply that any decoder (j) is 
unreliable over the family {e^^\ . . . ,6^^^^] : 



(a) The sample size n is upper bounded as 



11 



(h) The sample size n is upper bounded as 



s(9<') II SO) 

k=ie=k+i 

These variants of Fano's inequality are standard and widely-used in the non-parametric statistics 
literature (e.g., |15 1 ll7 1 [3 H l30j): see Cover and Thomas fLO\ for a statement and proof of the original 
Fano's inequality. 

4.2 A key separation result 

In order to exploit the condition (j28p . one needs to construct families of models with relatively 
large cardinality (M large) such that the models are all relatively close in symmetrized Kullback- 
Leibler (KL) divergence. Recalling the definition (j2ip of the mean parameters and the form of the 
symmetrized KL divergence (I22p . we see that control of the divergence between and can 
be achieved by ensuring that their respective mean parameters /ist l^st stay relatively close for all 
edges (s, t) where the models differ. 

In this section, we state and prove a key technical lemma that allows us to control the mean 
parameters of a certain carefully constructed class of models. As shown in the proofs of Theorems [1] 
and [2] to follow, this lemma allows us to gain good control on the symmetrized Kullback-Leibler 
divergences between pairs of models. Our construction, which applies to any integer m > 2, is based 
on the following procedure. We begin with the complete graph on m vertices, denoted by Km- We 
then form a set of (™) graphs, each of which is a subgraph of Km, by removing a particular edge. 
Denoting by G** the subgraph with edge {s,t) removed, we define the Ising model distribution 
^e{G-^) by setting [(9(G**)]„„ = A for all edges {u,v), and [e{G'^)]st = 0. 

The following lemma shows that the mean parameter /x^t = Eg^f^st) [X^X^] approaches its maxi- 
mum value 1 exponentially quickly in the parameter lo = Am. 



Lemma 1. Suppose that to = Xm > 2. Then the likelihood ratio on edge {s,t) is lower bounded as 

> ^-P(^-i^). (29) 



Pe(G..)[X,X, = +l] ^ > exp (f - § A) 

¥g^G^t)[XsXt = -1] l-Qst - m + l 

and moreover, the mean parameter over the pair (s,t) is lower bounded as 



2 m + l exp(^ , ^ 

Kg(Gst)[X,Xt] > 1 ^ — ^ \, . (30) 

^ - exp(f) + (m + l)exp(f ) ^ ^ 

Proof. Let us introduce the convenient shorthand qst = f0(^G''^)[XsXt = 1]. We begin by observing 
that the bound ([29j) implies the bound ([30|) . Indeed, suppose that equation (p9]) holds, or equiva- 



lently that qst > -p^ where b = Observing that Kg^^Qst-jlXsXt] = 2qst — 1, we see that 



> implies that 



2b 2 

En(r;st)\XsXt] > --1 = 1 -, 



from which equation ()30p follows. 
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The remainder of our proof is devoted to proving the lower bound (|29p. Some calculation shows 
that 

1st Er=o(T)-P(i[(2^--- + l)'-4])^ ^3^^ 



1-^^* Er=o(7)«^p(M(2j-"^)^]) 



We lower bound the ratio (j3ip by choosing one of largest terms in the denominator. It can be shown 
that for Am > 2, the largest terms always lie in the range j > 3m/4 and j < m/4. Accordingly, we 
may choose a maximizing point j* > 3m/ 4. Since all the terms in the numerator are non- negative, 
we have 

P,(G..)[M = +1] ^ (;^)exp(|[(2r-m + l)2-4]) 



V-)[^s^* = -l] " (m + l)(;^)exp(|[(2i*-m)2]) 
_ exp (I [4j* - 2m - 3] ) 



> 



m + 1 
exp (I [m — 3]) 

m + 1 
exp (I - jA) ^ 
m + 1 ' 



which completes the proof of the bound ([29]) . 



4.3 Proof of Theorem [U 



□ 



We begin with necessary conditions for the bounded degree family Qp^d- The proof is based on ap- 
plying Fano's inequality to three ensembles of graphical models, each contained within the family 

Ensemble A: In this ensemble, we consider the set of (2) graphs, each of which contains a single 
edge. For each such graph — say the one containing edge (s,t), which we denote by Hgt — we set 
[6{Hst)]st = ^, and all other entries equal to zero. Clearly, the resulting Markov random fields 
Pg(^H) a-ll belong to the family Qp^d{^,^)- (Note that by definition, we must have u; > A for the 
family to be non-empty.) 

Let us compute the symmetrized Kullback-Leibler divergence between the MRFs indexed by 
9{Gst) and 6{Guv)- Using the representation ()22p . we have 



= 2X¥.e^H.,)[XsXtl 

since ¥.Q(^Hst)[XuXv] = for all {u,v) / and ¥.Q(^H^^)[XuXy\ = ¥.0(^Hst)[XsXt]. Finally, by 

definition of the distribution ¥'qi^u^_^^, we have 

so that we conclude that the symmetrized Kullback-Leibler divergence is equal to 2Atanh(A) for 
each pair. 
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Using the bound (j28p from Lemma [2] with M = (2) , we conclude that the graph recovery is 
unrehable (i.e., has error probabihty above 1/2) if the sample size is upper bounded as 

n < (32) 
< Atanh(A)- ^'^^> 

Ensemble B: In order to form this graph ensemble, we begin with a grouping of the p vertices into 
Lrf+iJ groups, each with d+1 vertices. We then consider the graph G obtained by fully connecting 
each subset oi d+1 vertices. More explicitly, G is a graph that contains Ls+rJ cliques of size d + 1. 
Using this base graph, we form a collection of graphs by beginning with G, and then removing a 
single edge {u,v). We denote the resulting graph by C^^. Note that if p > 2{d + 1), then we can 
form 

I P > 

^d + l\ 2 J - i 

such graphs. For each graph G™, we form an associated Markov random field ¥q(^quv^ by setting 
[0{G^^)]ab = A > for all (a, b) in the edge set of G™, and setting the parameter to zero otherwise. 

A central component of the argument is the following bound on the symmetrized KuUback- 
Leibler divergence between these distributions 

Lemma 2. For all distinct pairs of models 9{G^^) 7^ 6{G^^) in ensemble B and for all A > 1/d, 
the symmetrized Kullback-Leibler divergence is upper bounded as 



s{e{G'')\\eiGn) < ^^''"'^^r^"^ 



exp(- 



2 



Proof. Note that any pair of distinct parameter vectors 6{G^^) 7^ 9{G^^) differ in exactly two edges. 
Consequently, by the representation ()22p . and the definition of the parameter vectors. 



S{9{G^^) II 6'(G™)) = X[E0(^Quv^[XsXt] — Eg(^Qst^[XsXt]) + A(Eg(Gst)[X„Xt,] - E0(^Quv)[XuXy]) 
< X{1 -Eo^G^t)[XsXt]) + X{1 -Eg(^G^.)[XuX,]), 

where the inequality uses the fact that A > 0, and the edge-based mean parameters are upper 
bounded by 1. 

Since the model Pg(G'st) factors as a product of separate distributions over the [^^J cliques, we 
can now apply the separation result ()30p from Lemma [T] with m = d + 1 to conclude that 



5(0(G^')||e(G"^)) < 2A 



2{d + 2)exp{^^' 
exp{^^) + {d + 2) exp(f : 



< 



2 

8Xd exp(^ 
exp(^) 



as claimed. □ 

Using Lemma [2] and applying the bound (I28p from Lemma [2] with M = ^ yields that for 
probability of error below 1/2 and Ad > 2, we require at least 

log(2^ - 1) ^ exp(f ) log(2^ - 1) 



5(0(G^*)||0(G™)) - 8dAexp^3A 



2 
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2/1R f_ „1W ^ 1 A of ^ ^ cxp (dA/4)dAlog(g^-l) 

128 exp{f^) 



samples. Since exp(t/4) > t /16 for all t > 1, we certainly need at least n > 
samples. Since a; = dA in this construction, we conclude that 

exp(a;/4)dAlog(^-l) 



n > 



128exp(^) 



samples are required, as claimed in Theorem [TJ 

Ensemble C: Finally, we prove the third component in the bound ([8]). Li this case, we consider 
the ensemble consisting of all graphs in Gp^d- From Lemma [T]^b), we have 

1 ir I ^ d{d + l) p 
loglGp^dl > ^ logL- 



2 °^d+V 

2 a + 1 e 

.dp p 
> — log — . 

For this ensemble, it suffices to use a trivial upper bound on the mutual information (|25p . namely 

I(K1;G) < HiX^) < np, 

where the second bound follows since X" is a collection of np binary variables, each with entropy 
at most 1. Therefore, from the Fano bound ()27p . we conclude that the error probability stays above 
1/2 if the sample size n is upper bounded as n < | log as claimed. 

4.4 Proof of Theorem [2] 

We now turn to the proof of necessary conditions for the graph family ^ with at most k edges. 
As with the proof of Theorem [21 it is based on applying Fano's inequality to three ensembles of 
Markov random fields contained in Qp^k{X,io). 

Ensemble A: Note that the ensemble (A) previously constructed in the proof of Theorem [T] is also 
valid for the family Qp^kiX,Lo), and hence the bound (j32p is also valid for this family. 



Ensemble B: For this ensemble, we choose the largest integer m such that k + l> (^) . Note that 
we certainly have 



We then form a family of (J^) graphs as follows: (a) first form the complete graph Km on a subset 
of m vertices, and (b) for each (s,t) € Km, form the graph G** by removing edge {s,t) from Km- 
We form Markov random fields on these graphs by setting [0{G^^)]wz = A if (wjz) E E{G^^), and 
setting it to zero otherwise. 



Lemma 3. For all distinct model pairs 6{G^^) and 6{G'^'"), we have 

cp(f ) ^ 
exp(^) 



s{e{G^^)\\e[Gn) < ''"'"P^y.""^^'^ (33) 
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Proof. We begin by claiming for any pair (s, t) ^ (n, v), the distribution Pg^fjui.) (i.e., corresponding 
to the subgraph that does not contain edge {u,v)) satisfies 

^ exp(2A) p— — = exp(2A)- — , (34) 



¥g(^QuVj[XsXt — —1] ¥0(^Qsf^[XsXt — —1] 1 — ' 

where we have re-introduced the convenient shorthand qst = P6i(G«t) [^s^t = 1] from Lemma [TJ 

To prove this claim, let ¥g be the distribution that contains all edges in the complete subgraph 
Km, each with weight A. Let Z{9) and Z{9{G'^'")) be the normalization constants associated with 
Fg and P6i(g"") respectively. Now since A > by assumption, by the FKG inequality [2], we have 

Pe(G-)[X,Xt = +l] ^ Fe[XsXt = +l] 



Pe(G-)[X,Xt = -l] - ¥e[XsXt = -l] 

We now apply the definition of ¥g and expand the right-hand side of this expression, recalling the 
fact that the model Pe(G's*) does not contain the edge {s,t). Thus we obtain 

¥e[XsXt = +1] exp(A)^§g^ Pe(G-)[^.^t = +1] 



FeiX^X, = -1] exp(-A)^§§p P.(a=.)[X,X, = -1] 

^e(G=t)[XsXt = +1] 



exp(2A) 



^e{G=t)[XsXt = -1] ' 



which establishes the claim (j34p . 

Finally, from the representation (1221) for the symmetrized Kullback-Leibler divergence and the 
definition of the models, 

Eg(^QuV^[XuXy]] + \{Eg(^QuV^[XsXt\ — E0(^Q3t^[XsXt]} 

= 2A{E5i(g«i') [XsXt] - E6i(G«t) [XsXt] } , 

where we have used the symmetry of the two terms. Continuing on, we observe the decomposi- 
tion ¥,0(^Qsf^[XsXt] = 1 — 2F 0(^Qst^[XsXt = —1], and using the analogous decomposition for the other 
expectation, we obtain 

3(6(0'') II e(Gn) = 4A{Pe(G.*)[X,Xt = -1] - FeiG^^.)[XsXt = -1]} 



(a) 
< 



+ 1 exp(2A)^ + l} 
4A r 1 1 -1 

1^ " exp(2A) + i^J 
4A (exp(2A) - 1) 1 



+ [exp(2A) + i^ 



qst 



where in obtaining the inequality (a), we have applied the bound (j34|) and recalled our shorthand 
notation qst = FQ(^Qst^[XsXt = +1]. Since A > and (1 — qst) / Qst > 0, both terms in the denominator 
of the second term are at least one, so that we conclude that S(9(G'') \\ 0(G'^'")) < ^" ^^1?''^ ^ -^ . 
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Finally, applying the lower bound (I29|) from Lemma [T] on the ratio Qst/i^ — Qst), we obtain that 

||0(G™)) < 



«t^„^,^«^-^^ / 4A (exp(2A) - 1) (m + 1) 

3 



exp(f - l\) 
^ IGcjexof^ 



16Ljexp(^) sinh(A) 



exp(|) 

where we have used the fact that A (m + 1) < 2mA = 2uj. □ 

By combining Lemma [3] with Lemma [2]^b), we conclude that for correctness with probability at 
least 1/2, the sample size n must be at least 

exp(|)logi|) exp(l) log(A:/8) 

n > > 



32a;exp(^) sinh(A) 64a;exp(^) sinh(A) 
as claimed in Theorem [2j 



5 Proofs of sufficient conditions 

We now turn to the proofs of the sufficient conditions given in Theorems [3] and HI respectively, for the 
classes Qp^d and Qp^k- In both cases, our method involves a direct analysis of a maximum likelihood 
(ML) decoder, which searches exhaustively over all graphs in the given class, and computes the 
model with highest likelihood. We begin by describing this ML decoder and providing a standard 
large deviations bound that governs its performance. The remainder of the proof involves more 
delicate analysis to lower bound the error exponent in the large deviations bound in terms of the 
minimum edge weight A and other structural properties of the distributions. 



5.1 ML decoding and large deviations bound 

Given a collection X^' = . . . , of n i.i.d. samples, its (rescaled) likelihood with respect 

to model Fq is given by 

n 

4(X^) := -VlogP,[X«]. (35) 
n ^-^ 

For a given graph class Q and an associated set of graphical models {Q(G) \ G G the maximum 
likelihood decoder is the mapping : X ^ Q defined by 

= argmax4(c;)(X^). (36) 

(If the maximum is not uniquely achieved, we choose some graph G from the set of models that 
attains the maximum.) 

Suppose that the data is drawn from model Pei(G) for some G £ Q. Then the ML decoder cj)* 
fails only if there exists some other 9{G') / 0{G) such that ^^(^•/^(X") > ^5i((j)(X"). (Note that we 
are being conservative by declaring failure when equality holds). Consequently, by union bound, 
we have 

P40*(X-)/G] < IP[V')(Xi)> 
G'eg\G 

Therefore, in order to provide sufficient conditions for the error probability of the ML decoder to 
vanish, we need to provide an appropriate large deviations bound. 
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Lemma 3. Given n i.i.d. samples = {X^^\ . . . from ^^'(G)' /^^ ^'^V 7^ /zatfe 

77 

P[V)™> < exp(--J(^(G)||0(G')), (37) 

where the distance S was defined previously (I19p . 



Proof. So as to lighten notation, let us write 9 = 9{G) and ^' = 9{G'). We apply the ChernofF 
bound to the random variable V = £g/(X") — ^5i(X"), thereby obtaining that 



^ logPJy > 0] < - inf logE0rexp(sy)l 



n n s>0 

inf 

s>0 



x-e{-i,+i}p 

< log Z(^/2 + 9'/2) - i log Z(^) - i log Z(0'), 

where Z{9) denotes the normalization constant associated with the Markov random field Pg, as 
defined in equation ([3]). The claim then follows by applying the representation (j20p of J {9 \\9'). 

□ 

5.2 Lower bounds based on matching 

In order to exploit the large deviations claim in Lemma [3l we need to derive lower bounds on the 
divergence J{9{G) \\ 9{G')) between different models. Intuitively, it is clear that this divergence 
is related to the discrepancy of the edge sets of the two graph. The following lemma makes this 
intuition precise. We first recall some standard graph-theoretic terminology: a matching of a graph 
G = (y, E) is a subgraph H such that each vertex in H has degree one. The matching number of 
G is the maximum number of edges in any matching of G. 

Lemma 4. Given two distinct graphs G = (y,E) and G' = {V^E'), let m{G,G') be the matching 
number of the graph with edge set 

EAE' := {E\E') U {E'\E'). 

Then for any pair of parameter vectors 9{G) and 9{G') in Q{X,uj), we have 

mcnHG')) > (38) 

Proof. Some comments on notation before proceeding: we again adopt the shorthand notation 
9 = 9{G) and 9' = 9{G'). In this proof, we use Cj to denote either a particular edge, or the set 
of two vertices that specify the edge, depending on the context. Given any subset ^ C V^, we use 
xa = {xs-: s G A} to denote the collection of variables indexed by A. 

Given any edge e = (u, v) with u ^ A and v ^ A, we, define the conditional distribution 



over the random variables Xe = {xu,Xy) indexed by the edge. Finally, we use 

2 2 



18 



to denote the divergence (fT9]) applied to the conditional distributions of {Xu,Xy \ Xa = xa)- 

With this notation, let M C EAE' be the subset of edges in some maximal matching of the 
graph with edge set EAE'; concretely, let us write M = {ei, . . . ,6^}, and denote by V\M the 
subset of vertices that are not involved in the matching. Note that since J is a combination of 
Kullback-Leibler (KL) divergences, the usual chain rule for KL divergences ^lOj also applies to it. 
Consequently, we have 

m 

J{0\\e') > J2 E Fej^{xy\M,^e„...,Xe,_,)r/,^J9\\d'), 

where for each we are conditioning on the set of variables xsi_-^ '■ = {xv\Mj ^ei > • • • > Xef_^)- Finally, 
from Lemma [7] in Appendix 1X1 for all £ = 1, . . . , m and all values of xs^_-^, we have 



Jxi {0\\0') > 7^-^ sinh2(^), 

'^^'^-i^ " ' - 3exp(2a;) + l W 

from which the claim follows. 



□ 



5.3 Proof of Theorem [3](a) 

We first consider distributions belonging to the class ^p^(i(A,u;), where A is the minimum absolute 
value of any non-zero edge weight, and co is the maximum neighborhood weight Consider a 
pair of graphs G and G' in the class Qp^d that differ in i = \E AE'\ edges. Since both graphs have 
maximum degree at most d, we necessarily have a matching number m{G, G') > Note that the 
parameter i = \E AE'\ can range from 1 all the way up to dp, since a graph with maximum degree 
d has at most ^ edges. 

Now consider some fixed graph G £ Qp^d and associated distribution Pe(G) G Gp^'-, we upper 

bound the error probability P0(g')[i;^>*(X") ^ G]. For each t = 1, 2, ... , dp, there are at most (^2)^ 
models in Qp ^ with mismatch (. from G. Therefore, applying the union bound, the large deviations 
bound in Lemma [3l and the lower bound in terms of matching from Lemma HI we obtain 

Pe(G)r(X?)/G] < 

< 
< 

This probability is at most 8 under the given conditions on n in the statement of Theorem [3|^a) . 



^V^y 3exp(2w) + l ^2^/ 

pd max exp <^ log \ „ ] — n ; — smh ( — ) > 

e=i,...,pd \ij 3exp(2u;) + l ^2^ 

max exp I logfpd) + ^logn — n ^-j^ — }■ sinh^f — )1. 

e=i,...,pd ' 3exp 2a;) + l ^2^/ 



5.4 Proof of Theorem |4] 

Next we consider the class Qp^k of graphs with at most k edges. Given some fixed graph G G Gp,k^ 
consider some other graph G' G Qp^k such that the set EAE' has cardinality m. We claim that for 
each m = 1,2,..., 2k, the number of such graphs is at most 

To verify this claim, recall the notion of a vertex cover of a set of edges, namely a subset of 
vertices such that each edge in the set is incident to at least one vertex of the set. Note also that the 
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vertices involved in any maximal matching form a vertex cover. Consequently, any maximal match- 
ing over the edge set E AE' of cardinality m be described in the following (sub optimal) fashion: 
(i) first specify which of the k edges in E are missing in E'; (ii) describe which of the at most 2m 
vertices belong to the vertex cover defined by the maximal matching; and (iii) describe the subset of 
at most k vertices that are connected to it. This procedure yields at most 2^p^'^p'^^'^ = 2'^p2m(fc+i) 
possibilities, as claimed. 

Consequently, applying the union bound, the large deviations bound in LemmaO and the lower 
bound in terms of matching from Lemma [H we obtain 

m=l 

< k max exp /A: + 2m(A; + 1) log p — n -, — ; sinh^f— )1. 

m=i,...,fc ^ ' 3exp(2u;) + l ^2^ 

This probability is less than 5 under the conditions of Theorem HJ which completes the proof. 
5.5 Proof of Theorems [3](b) and [4](b) 

Finally, we prove the sufficient conditions given in Theorem [3l^b) andlUb), which do not assume 
that the decoder knows the parameter vector 9{G) for each graph G S Qp^d- In this case, the simple 
ML decoder (I36p cannot be applied, since it assumes knowledge of the model parameters 0{G) for 
each graph G £ Qp^d- A natural alternative would be the generalized likelihood ratio approach, 
which would maximize the likelihood over each model class, and then compare the maximized 
likelihoods. Our proof of Theorem [3^b) is based on minimizing the distance between the empirical 
and model mean parameters in the ^oo norm, which is easier to analyze. 

5.5.1 Decoding from mean parameters 

We begin by describing the graph decoder used to establish the sufficient conditions of Theo- 
rem [3{b). For any parameter vector G M^a)^ let ^[0) G represent the associated set of 
mean parameters, with element given by [^{9)\st '■= ^e[XsXt]. Given a data set X" = 
{X^'^\ . . . the empirical mean parameters are given by 

i=i 

For a given graph G = (y,i?), let eA,.(G) C be a subset of exponential parameters that 
respect the graph structure — viz. 

(a) we have 0uv = for all (u, v) ^ E; 

(b) for all edges (s,t) G E, we have \Ost\ > A, and 

(c) for all vertices s G y, we have X]teA^(s) \^st\ < ^■ 

For any graph G and set of mean parameters /i G M^z)^ we define a projection- type distance via 
Jg(m) = minege^^(G) - ^(6')||oo. 

We now have the necessary ingredients to define a graph decoder (f^ : —>■ Gp,d', in particular, 
it is given by 

0t(X^) := arg min Jg{J1), (42) 
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where /I are the empirical mean parameters previously defined (j4ip . (If the minimum (j42p is not 
uniquely achieved, then we choose some graph that achieves the minimum.) 

5.5.2 Analysis of decoder 

Suppose that the data are sampled from Pe(G') for some fixed but known graph G € Qp^d-, and 
parameter vector 9{G) S Oa,<^(G). Note that the graph decoder (j)^ can fail only if there exists some 
other graph G' such that the difference A(G' ; G) := Jc'iV) ~ Jg{V) is not positive. (Again, we 
are conservative in declaring failure if there are ties.) 

Let 9' denote some element of Q\^uj{G') that achieves the minimum defining Jg'{V)i so that 
Jg'{V) = II/" ~ /"(^')lloo- Note that by the definition of Jq, we have Jg{G) < ||/2 — 0(G)||oo, where 
9{G) are the parameters of the true model. Therefore, by the definition of A(G'; G), we have 

A(G';G) > 11/2 -M^') Hoc -||/i-M^(G)) Hoc 

> M9')-^,{9{G))\U-2\\■n-^^{e{G))\U (43) 

where the second inequality applies the triangle inequality. 

Therefore, in order to prove that A(G' ; G) is positive, it suffices to obtain an upper bound on 
11/2 — /i(0(G))||oo, and a lower bound on W^iiO') — /x(6'(G))||oo, where 6' ranges over Q\^uj{G'). With 
this perspective, let us state two key lemmas. We begin with the deviation between the sample 
and population mean parameters: 

Lemma 5 (Elementwise deviation). Given n i.i.d. samples drawn from IP6i(g); the sample mean 
parameters Jl and population mean parameters fi{9{G)) satisfy the tail bound 

t^ 

P[||/I-/x(0(G))||oo >t] < 2exp(-n- + 21ogp). 



This probability is less than S for t > y ^*°sp+^°s(^/'^) _ 

Our second lemma concerns the separation of the mean parameters of models with different 
graph structure: 

Lemma 6 (Pairwise separations). Consider any two graphs G = (y,E) and G' = {V,E'), and 
an associated set of model parameters 9{G) £ @x,ui{G) and 9{G') G Q\,u){G'). Then for all edges 
{s,t) e E\E'UE'\E, 

max |E,(G)[A„A,]-E,(Go[A„A,]| > fT^'it^"^^^ , y 
u(^{s,t},v&v 2u; (3exp(2u;) + Ij 

We provide the proofs of these two lemmas in Sections 15.5.31 and 15.5.41 below. 

Given these two lemmas, we can complete the proofs of Theorem [3l^b) and Theorem IDj^b). Using 
the lower bound (j43p . with probability greater than 1 — 5, we have 



A(G';G) > ""^'W^) /41ogp + log(2/^) 

^ ^ ~ u; (3exp(2a; + l) V n 

This quantity is positive as long as 

-u (3exp(2a; + l^l i 2 



n > 



sinh^(A/4) 



{l61ogp + 41og(2/5)}, 



which completes the proof. 

It remains to prove the auxiliary lemmas used in the proof. 
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5.5.3 Proof of Lemma [5] 

This claim is an elementary consequence of the Hoeffding bound. By definition, for each pair {s,t) 
of distinct vertices, we have 

n 

Jlst-[mG))U = - j;X«xW-E,(G)[Mt], 

1=1 

which is the deviation of a sample mean from its expectation. Since the random variables 

are i.i.d. and lie in the interval [—1,+!], an application of Hoeffding's inequality [16] yields that 

F[\list - HHG))]st\ > t] < 2exp(-ntV2). 

The lemma follows by applying union bound over all (2) edges of the graph, and the fact that 
log(P) < 2 log p. 



5.5.4 Proof of Lemma [6] 

The proof of this claim is more involved. Let (s,t) be an edge in E\E\ and let C be the set of all 
other vertices that are adjacent to s or t in either graphs — namely, the set 

C := {ueV \ {u,s) or {u,t) e EUE'} = {M {s) U M {t))\{s , t} . 

Our approach is to condition on the variables xc = {xu,u G C}, and consider the two conditional 
distributions over the pair (Xs,Xt), defined by and Pg' respectively. In particular, for any subset 
S C V, let us define the unnormalized distribution 



Qe{xs) ■= exp( ^ OuvXuXv), 

Xa, a^S (u,v)^E 



(44) 



obtained by summing out all variables Xa for a ^ S. With this notation, we can write the conditional 
distribution of {Xs,Xt) given {Xc = xc} as 



}e{xs,xt,xc) 
Qe{xc) 



(45) 



As reflected in our choice of notation, for each fixed xc, the distribution (139p can be viewed as a 
Ising model over the pair {Xs,Xt) with exponential parameter 0[xc]. We define the unnormalized 
distributions Qe'[xs] the conditional distributions P6/'[xc] analogous manner. 

Our approach now is to study the divergence J{9[xc] \\ 9'[xc]) between the conditional distri- 
butions induced by Fg and Fgi. Using Lemma [3 from Appendix El for each choice of xc, we have 



J{9[xc]\\9'[xc]) > 



[Ji9[Xc]\\9'[Xc])] > 



where the expectation is taken under the model 



e+e' 



— r sinh^f — ), 

3exp(2u;) + l M^' 

Some calculation shows that 



(46) 



Ee+ei[J{9[Xc]\\9'[Xc])] = E 



e+e' 
2 



log' 



eiXc) 



-{Xc) 



+ Ee+e' 
2 



log- 



c) 



,0^ 



{Xc) 
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Applying Jensen's inequality yields that 



log- 



(Xc) 



< log 



log 



E, 



2 [■ 



XC\ 



E 



with an analogous inequality for the term involving Qg/. Consequently, we have the upper bound 



¥.e+ei[J[e[Xc]\\e'[Xc])\ < log- 



(47) 



which we exploit momentarily. 

In order to use this bound, let us upper bound the quantity 



A(e,e') := ¥.e[D{e[X, 



,e + e' 



c\ 



2 )[Xc])\ +MD{0'[Xc\ II {^)[Xc\)]. 



By the definition of the Kullback-Leibler divergence, we have 

u&N{s)\t veN{t)\s 



+ Eelog , +Eg/log , (48) 



^{Xc) 



'{Xc) 



In this equation, the quantities /x and denote mean parameters computed under the distributions 
¥q and Pg/ respectively. But by Jensen's inequality, we have the upper bound 



Qe+eLiXc) 



< log 2 



(49) 



with an analogous upper bound for the term involving 9' . 
Combining the bounds (j^7|) , (08]) and (09]) , we obtain 

E^[j(e[Xd||^'[Xc])] < (/i,„-/x;j(e,,-0+ E 

Finally, since Em6A/'(s) I^«sI ^ by the definition ^ (and similarly for the neighborhood of t), we 
conclude that 

¥.e+e'[J{e[Xc]\\e'[Xc])\ < 2a; max l^u^-^'^J. 

~2~ nejs.tl.fey 



Combining this upper bound with the lower bound (|46|) yields the claim. 



6 Discussion 

In this paper, we have analyzed the information-theoretic limits of binary graphical model selection 
in a high-dimensional framework, in which the sample size n, number of graph vertices p, number 
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of edges k and/or the maximum vertex degree d are allowed to tend to infinity. We proved four 
main results, corresponding to both necessary and sufficient conditions for the class Qp^d of graphs 
on p vertices with maximum vertex degree d, as well as for the class Qp^k of graphs on p vertices 
with at most k edges. More specifically, for the class Qp^d-, we showed that any algorithm requires at 
least n > cd^logp samples, and we demonstrated an algorithm that succeeds using n < c' d^logp 
samples. Our two main results for the class Qp^d have a similar flavor: we show that any algorithm 
requires at least n > cklogp samples, and we demonstrated an algorithm that succeeds using 
n < c' k"^ \ogp samples. Thus, for graphs with constant degree d or a constant number of edges A;, 
our bounds provide a characterization of the information-theoretic complexity of binary graphical 
selection that is tight up to constant factors. For growing degrees or edge numbers, there remains 
a minor gap in our conditions. 

In terms of open questions, one immediate issue is to close the current gap between our neces- 
sary and sufficient conditions; as summarized above, these gaps are of order d and k for Qp^d and 
Qp^k respectively. We note that previous work by Ravikumar et al. [22] has shown that a compu- 
tationally tractable method, based on ^i-regularization and logistic regression, can recover binary 
graphical models using n = Q.{d^\ogp) samples. This result is consistent with the theory given 
here, and it would be interesting to determine whether or not their algorithm, appealing due to its 
computational tractability, is actually information-theoretically optimal. Moreover, in the current 
paper, although we have focused exclusively on binary graphical models with pairwise interactions, 
many of the techniques and results (e.g., constructing "packings" of graph classes, Fano's lemma 
and variants, large deviations analysis) applies to more general classes of discrete graphical models, 
and it would be interesting to explore extensions in this direction. 
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A A separation lemma 

In this appendix, we prove the following lemma, which plays a key role in the proofs of both 
Lemmas m and [6l Given an edge e = and some subset U C F\{s,t}, recall that Jx^i^ II 
denotes the divergence (I19p applied to the conditional distributions of {Xu,X^ \ Xa = xa), as 
defined explicitly in equation (j40p . 



Lemma 7. Consider two distinct graphs G = iy,E) and G' = {V^E'), with associated parameter 
vectors 9 and 0' . Given an edge {s,t) £ E\E' and any subset U C y\{s,t}, we have 

Proof. To lighten notation, we define e := 3exp(2a;)+i sinh^(^) > 0. Note that from the defi- 
nition ([5]), we have uj > \6uv\, which implies that e < 2. For future reference, we also note the 
relation 

[exp(^)-exp(-^)]' = 26 + 66exp(2^). (51) 

With this set-up, our argument proceeds via proof by contradiction. In particular, we assume 
that 

JZi(^\\e') < 6/2 (52) 
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and then derive a contradiction. Recall from equation (j44p our notation Qd{xA) for the unnormal- 
ized distribution applied to the subset of variables xa = {xi, i £ A}. With a little bit of algebra, 
we find that 

re, in II ni-, i ^djxu) ^e'jxy) 

Jxu\^\\G) = log^ —■ 



ZS=XU 

Let us introduce some additional shorthand so as to lighten notation in the remainder of the 

proof. First we define (5{x) : = \/Qe{x) Qe'ix), as well as a{x) : = ^ ■ Now observe that 

a{x) = exp(A(x)/2), where ^{x) : = Yl{s t)<^E ^stXsXt — Yl{s t)<^E' ^'st^sXt- Observe that LemmaE] 
in Appendix [B] characterizes the behavior of A(x) under changes to x. Finally, we define the set 

yixu) := {y £{-l,+lV \ y^ = Xi foralHGt/}, 

corresponding to the subset of configurations y G {—1, +1}^ that agree with xu over the subset U. 
From the definitions of a and (3, we observe that 

[ E ^o{y)] [ E ^o'(y)] = [ E »(y)m i E j^] 

< {l + ef[ E (53) 

y&yi^u) 



where the inequality follows from the fact e < 2, our original assumption (j52p . and the elementary 
relations 

exp(z) < (l + 2z) < {l + 2zf for all z G (0,1]. 
Now consider the set of quadratics in t, one for each y £ y{xu), given by 

Piy)aiy)-2il + e)Piy)t + ^f = 0. 

a(y) 

Summing these quadratic equations over y £ y{xu) yields 

q{t) := E - 2t(i + 6) E + E §!y = 0' 

yey{xu) yey{xu) y&y{xu) ^ ' 



which by which by equation (j53p must have two real roots. 

Let t* denote the value of t at which q{-) achieves its minimum. By the quadratic formula, we 
have 

^ i^ + ^)Y:ymxu)(^^y) > 

T.yey{xuP^y)l'^^y)) 

Since q{t*) < 0, we obtain 

2et* E ^(y) > E P{y)t*[VN^)-V^Wt^]'- (54) 
y&yi^u) y&yi^u) 

Using the notation 

y*{xu) := {y£y{xu) I max{t*/a(y), a(?/)/t*} < exp(6'™/2)}, 
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we can rewrite equation (j54p as 



2et* Yl f^(y) > E Piy)t*\[Vt^My)-V^W^]'-2e[ (55) 
> 2ef Yl exp(2w), 

where inequality (a) follows from the definition of y*{xu), the monotonically increasing nature of 
the function /(s) = {s — for s > 1, and the relation ([STj) . 

Prom Lemma [HI for each y £ y*{xu), we obtain a configuration a ^ !y*(a^c/) by flipping either 
u, V or both. Note that at most three configurations y G y*{xu) can yield the same configuration 
z ^ y*{xu)- Since these flips do not decrease f3{y) by more than a factor of exp(2u;). we conclude 
that 

J2 Piy) < 3exp(2a;) ^ /^(^z), 

which is a contradiction of equation (|55|) . Hence the quadratic g(-) cannot have two real roots, 
which contradicts our initial assumption ([52|) . 

□ 



B Proof of a flipping lemma 

It remains to state and prove a lemma that we exploited in the proof of LemmafTlfrom Appendix [Al 
Lemma 8. Consider distinct models 9 and 6', and for each x £ {— 1,+!}^, define 

A{x) := Y OuvXuXv - Y ^'uv^uXv (56) 

{u,v)eE {u,v)£E' 

Then for any edge {s,t) G E\E' and for any configuration x G { — 1, +1}^; flipping either Xs or xt 
(or both) changes A(x) by at least \9st\- 

Proof. We use N{s) and N'{s) to denote the neighborhood sets of s in the graphs G = {V, E) and 
G' = (y, £") respectively, with analogous notation for the sets N{t) and N'{t). We then define 

5s{x) := Y ^suXu, and 6'^{x) := ^ 

with analogous definitions for the quantities 6t{x) and S^{x). Similarly, we define 

7s(rc):= Y ^suXu, and 7t(x) := ^ OtvXy. 
ueJ\f{s)rW{s) v£Af{t)rW{t) 

Now, let the contribution to the first (respectively second) term of £ not involving s and t be 
(respectively rj), namely 

/i(x) . — ^ ^ ^uvXuXv } and (x) . — ^ ^ Oi^uXuXy . 

u,v^{s,t} u,v^{s,t} 
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With this notation, we first proceed via proof by contradiction to show that A(x) must change 
when {xs,xt) are flipped: to the contrary, suppose that for A(x) stays fixed for all four choices 
{xs,xt) £ {—1,+!}^. We then show that this assumption implies that 9st = 0. Note that both of 
the terms 6s{x) and 5t{x) include a contribution from the edge {s,t). When {xs,xt) = (+1,+1), 
we have 

{5s{x)-est) + {St{x)-est)+0st+f^{x)+-fs{x)+-ft{x) = S',{x)+S't{x)+^l'{x)+-fs{x)+-ft{x)+A{x), 
whereas when {xs,xt) = (—1, —1), we have 

-{6six)+est)-i5tix)+est)+0st+f^ix)-7s{x)-7t{x) = -6'^{x)-6[{x)+fi'{x)-5',ix)-jt{x)+A{x). 
Adding these two equations together yields the equality 

fi{x)-9st = fi'{x) + A{x). (57) 
On the other hand, for {xs,xt) = (—1, +1), we have 

-{6s{x)-est) + {6t{x) + est)-0st+Kx)-ls{x)+jt{x) = -5',{x) + 5't{x)+fi'{x)-js{x)+7t{x) + A{x), 
and for {xs,xt) = (+1,-1), we have 

{6s (x) + 6st)-{6t{x)-9st)-0st + l^{x)+-fs{x)- 7t [x] = 5's{x)-5[{x) + iJL{x)+-^s{x)--it{x) + A{x). 
Adding together these two equations yields 

^i{x) + est = ^x'{x) + A{x). (58) 

Note that equations (j57p and (I57p cannot hold simultaneously unless Ost = 0, which implies that 
our initial assumption — namely, that A(x) does not change as we vary {xs,xt) G {—1, — 1}^ — -was 
false. 

Finally, we show that the change in |A(2;)| must be at least \9st\- For each pair G {—1, +1}^; 
let £ij = A{x I Xs = i,xt = j) be the value of A(x) when Xg = i and xt = j. Suppose that for 
some constant c and 5 > 0, we have £ij € [c — e, c + e] for all By following the same reasoning 

as above, we obtain the inequalities fi{x) — Ogt > ^^'{x) + c — e and ^{x) + 9st < lJ{x) + c + e, which 
together imply that Ogt < e- In a similar manner, we obtain the inequalities ^{x) + 9st > jj! {x)+c—e 
and /i(x) — 6 St < ij!{x) + c + e, which imply that —6st < e, thereby completing the proof. 

□ 
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